23 research outputs found

    Finding regulatory DNA motifs using alignment-free evolutionary conservation information

    Get PDF
    As an increasing number of eukaryotic genomes are being sequenced, comparative studies aimed at detecting regulatory elements in intergenic sequences are becoming more prevalent. Most comparative methods for transcription factor (TF) binding site discovery make use of global or local alignments of orthologous regulatory regions to assess whether a particular DNA site is conserved across related organisms, and thus more likely to be functional. Since binding sites are usually short, sometimes degenerate, and often independent of orientation, alignment algorithms may not align them correctly. Here, we present a novel, alignment-free approach for using conservation information for TF binding site discovery. We relax the definition of conserved sites: we consider a DNA site within a regulatory region to be conserved in an orthologous sequence if it occurs anywhere in that sequence, irrespective of orientation. We use this definition to derive informative priors over DNA sequence positions, and incorporate these priors into a Gibbs sampling algorithm for motif discovery. Our approach is simple and fast. It requires neither sequence alignments nor the phylogenetic relationships between the orthologous sequences, yet it is more effective on real biological data than methods that do

    A Nucleosome-Guided Map of Transcription Factor Binding Sites in Yeast

    Get PDF
    Finding functional DNA binding sites of transcription factors (TFs) throughout the genome is a crucial step in understanding transcriptional regulation. Unfortunately, these binding sites are typically short and degenerate, posing a significant statistical challenge: many more matches to known TF motifs occur in the genome than are actually functional. However, information about chromatin structure may help to identify the functional sites. In particular, it has been shown that active regulatory regions are usually depleted of nucleosomes, thereby enabling TFs to bind DNA in those regions. Here, we describe a novel motif discovery algorithm that employs an informative prior over DNA sequence positions based on a discriminative view of nucleosome occupancy. When a Gibbs sampling algorithm is applied to yeast sequence-sets identified by ChIP-chip, the correct motif is found in 52% more cases with our informative prior than with the commonly used uniform prior. This is the first demonstration that nucleosome occupancy information can be used to improve motif discovery. The improvement is dramatic, even though we are using only a statistical model to predict nucleosome occupancy; we expect our results to improve further as high-resolution genome-wide experimental nucleosome occupancy data becomes increasingly available

    Towards a Complete Transcriptional Regulatory Code: Improved Motif Discovery Using Informative Priors

    No full text
    <p>Transcriptional regulation is the primary mechanism employed by the cell to ensure coordinated expression of its numerous genes. A key component of this process is the binding of proteins called transcription factors (TFs) to corresponding regulatory sites on the DNA. Understanding where exactly these TFs bind, under what conditions they are active, and which genes they regulate is all part of deciphering the transcriptional regulatory code. An important step towards solving this problem is the identification of DNA binding specificities, represented as motifs, for all TFs. In spite of an explosion of TF binding data from high-throughput technologies, the problem of motif discovery remains unsolved, due to the short length and degeneracy of binding sites. </p><p>We introduce PRIORITY, a Gibbs sampling-based approach, which incorporates informative positional priors into a probabilistic framework, to find significant motifs from high-throughput TF binding data. We use different data sources to build our positional priors and apply them to yeast ChIP-chip data: </p><p>* TFs can be classified into several structural classes based on their DNA-binding domains. Using a Bayesian learning algorithm, we show that it is possible to predict the class of a TF with remarkable accuracy, using information solely from its DNA binding sites. We further incorporate these results in the form of informative priors into PRIORITY, which learns the structural class of the TF in addition to its motif. </p><p>* In the nucleus, DNA is present in the form of chromatin--wrapped around nucleosomes--with certain regions being more accessible to TFs than others. It has been shown that functional binding sites are generally located in nucleosome-free regions. We use nucleosome occupancy predictions to compute a novel positional prior that biases the search towards the more accessible regions, thereby enriching the motif signal.</p><p>* Functional elements are often conserved across related species. Most conventional methods that exploit this fact use alignments. However, multiple alignments cannot always capture relocation and reversed orientation of binding sites across species. We propose a new alignment-free technique that not only accounts for these transformations, but is much faster than conventional methods. </p><p>All our priors significantly outperform conventional methods, finding motifs matching literature for 52 TFs. We produce a genome-wide map of TF binding sites in yeast based on these and other novel motif predictions.</p>Dissertatio

    No Promoter Left Behind (NPLB): learn de novo

    No full text

    DIVERSITY finds multiple modes in fly CTCF ChIP data.

    No full text
    <p>(a) 200bp regions centered around the summit of ChIP peaks, input to diversity. (b) Diversity reorders and realigns the data, revealing eight modes. (c) Motifs corresponding to modes. CTCF motifs from JASPAR and from high throughput SELEX [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1006090#pcbi.1006090.ref025" target="_blank">25</a>] are shown below. (d) Sequence conservation profile from phastCons, corresponding to nucleotides in b (e) The eight modes are displayed in decreasing ChIP score. (f) Violin plot of distance of each sequence from the closest transcription start site. (g) Violin plot of expression values of genes (log2(1+RPKM)) with TSS within 2000bp of the ChIP region. Red line shows the median value across all measured genes. (h) Overlaps with Su(Hw) and Pita ChIP experiments, respectively.</p
    corecore